Let us pull the dataset from the file US_Accidents_Dec21_updated.csv into a dataframe df_us_acc. As this file is huge in size, we cannot upload it on GitHub. Thus, we are storing the file in a folder called us_accidents_dataset that is located in one folder above the project folder.

df_us_acc <- data.frame(read.csv('../../us_accidents_dataset/US_Accidents_Dec21_updated.csv'))
str(df_us_acc)
## 'data.frame':    2845342 obs. of  47 variables:
##  $ ID                   : chr  "A-1" "A-2" "A-3" "A-4" ...
##  $ Severity             : int  3 2 2 2 3 2 2 2 2 2 ...
##  $ Start_Time           : chr  "2016-02-08 00:37:08" "2016-02-08 05:56:20" "2016-02-08 06:15:39" "2016-02-08 06:51:45" ...
##  $ End_Time             : chr  "2016-02-08 06:37:08" "2016-02-08 11:56:20" "2016-02-08 12:15:39" "2016-02-08 12:51:45" ...
##  $ Start_Lat            : num  40.1 39.9 39.1 41.1 39.2 ...
##  $ Start_Lng            : num  -83.1 -84.1 -84.5 -81.5 -84.5 ...
##  $ End_Lat              : num  40.1 39.9 39.1 41.1 39.2 ...
##  $ End_Lng              : num  -83 -84 -84.5 -81.5 -84.5 ...
##  $ Distance.mi.         : num  3.23 0.747 0.055 0.123 0.5 ...
##  $ Description          : chr  "Between Sawmill Rd/Exit 20 and OH-315/Olentangy Riv Rd/Exit 22 - Accident." "At OH-4/OH-235/Exit 41 - Accident." "At I-71/US-50/Exit 1 - Accident." "At Dart Ave/Exit 21 - Accident." ...
##  $ Number               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Street               : chr  "Outerbelt E" "I-70 E" "I-75 S" "I-77 N" ...
##  $ Side                 : chr  "R" "R" "R" "R" ...
##  $ City                 : chr  "Dublin" "Dayton" "Cincinnati" "Akron" ...
##  $ County               : chr  "Franklin" "Montgomery" "Hamilton" "Summit" ...
##  $ State                : chr  "OH" "OH" "OH" "OH" ...
##  $ Zipcode              : chr  "43017" "45424" "45203" "44311" ...
##  $ Country              : chr  "US" "US" "US" "US" ...
##  $ Timezone             : chr  "US/Eastern" "US/Eastern" "US/Eastern" "US/Eastern" ...
##  $ Airport_Code         : chr  "KOSU" "KFFO" "KLUK" "KAKR" ...
##  $ Weather_Timestamp    : chr  "2016-02-08 00:53:00" "2016-02-08 05:58:00" "2016-02-08 05:53:00" "2016-02-08 06:54:00" ...
##  $ Temperature.F.       : num  42.1 36.9 36 39 37 35.6 33.8 33.1 39 32 ...
##  $ Wind_Chill.F.        : num  36.1 NA NA NA 29.8 29.2 NA 30 31.8 28.7 ...
##  $ Humidity...          : num  58 91 97 55 93 100 100 92 70 100 ...
##  $ Pressure.in.         : num  29.8 29.7 29.7 29.6 29.7 ...
##  $ Visibility.mi.       : num  10 10 10 10 10 10 3 0.5 10 0.5 ...
##  $ Wind_Direction       : chr  "SW" "Calm" "Calm" "Calm" ...
##  $ Wind_Speed.mph.      : num  10.4 NA NA NA 10.4 8.1 2.3 3.5 11.5 3.5 ...
##  $ Precipitation.in.    : num  0 0.02 0.02 NA 0.01 NA NA 0.08 NA 0.05 ...
##  $ Weather_Condition    : chr  "Light Rain" "Light Rain" "Overcast" "Overcast" ...
##  $ Amenity              : chr  "False" "False" "False" "False" ...
##  $ Bump                 : chr  "False" "False" "False" "False" ...
##  $ Crossing             : chr  "False" "False" "False" "False" ...
##  $ Give_Way             : chr  "False" "False" "False" "False" ...
##  $ Junction             : chr  "False" "False" "True" "False" ...
##  $ No_Exit              : chr  "False" "False" "False" "False" ...
##  $ Railway              : chr  "False" "False" "False" "False" ...
##  $ Roundabout           : chr  "False" "False" "False" "False" ...
##  $ Station              : chr  "False" "False" "False" "False" ...
##  $ Stop                 : chr  "False" "False" "False" "False" ...
##  $ Traffic_Calming      : chr  "False" "False" "False" "False" ...
##  $ Traffic_Signal       : chr  "False" "False" "False" "False" ...
##  $ Turning_Loop         : chr  "False" "False" "False" "False" ...
##  $ Sunrise_Sunset       : chr  "Night" "Night" "Night" "Night" ...
##  $ Civil_Twilight       : chr  "Night" "Night" "Night" "Night" ...
##  $ Nautical_Twilight    : chr  "Night" "Night" "Night" "Day" ...
##  $ Astronomical_Twilight: chr  "Night" "Night" "Day" "Day" ...

First, let’s check the percentage of NA’s present in each columns of the dataset.

(colMeans(is.na(df_us_acc)))*100
##                    ID              Severity            Start_Time 
##              0.000000              0.000000              0.000000 
##              End_Time             Start_Lat             Start_Lng 
##              0.000000              0.000000              0.000000 
##               End_Lat               End_Lng          Distance.mi. 
##              0.000000              0.000000              0.000000 
##           Description                Number                Street 
##              0.000000             61.290031              0.000000 
##                  Side                  City                County 
##              0.000000              0.000000              0.000000 
##                 State               Zipcode               Country 
##              0.000000              0.000000              0.000000 
##              Timezone          Airport_Code     Weather_Timestamp 
##              0.000000              0.000000              0.000000 
##        Temperature.F.         Wind_Chill.F.           Humidity... 
##              2.434646             16.505678              2.568830 
##          Pressure.in.        Visibility.mi.        Wind_Direction 
##              2.080593              2.479350              0.000000 
##       Wind_Speed.mph.     Precipitation.in.     Weather_Condition 
##              5.550967             19.310789              0.000000 
##               Amenity                  Bump              Crossing 
##              0.000000              0.000000              0.000000 
##              Give_Way              Junction               No_Exit 
##              0.000000              0.000000              0.000000 
##               Railway            Roundabout               Station 
##              0.000000              0.000000              0.000000 
##                  Stop       Traffic_Calming        Traffic_Signal 
##              0.000000              0.000000              0.000000 
##          Turning_Loop        Sunrise_Sunset        Civil_Twilight 
##              0.000000              0.000000              0.000000 
##     Nautical_Twilight Astronomical_Twilight 
##              0.000000              0.000000

Here, the highest number of NAs is present in the column Number, followed by Precipitation.in., Wind_Chill.F. and some other columns. As we don’t require the column Number, we will drop the column. We have decided to keep the rest of the columns as they are part of our analysis. We are also dropping Description column for faster code execution.

df_us_acc <- subset(df_us_acc, select = -c(Number, Description))

As we have low number of NA data for other columns, we can just drop those records.

df_us_acc <- drop_na(df_us_acc)

Next, we will extract the year out of the Start_Time column to check the data distribution over the year.

df_us_acc$year<-format(as.Date(df_us_acc$Start_Time, format="%Y-%m-%d"),"%Y")
ggplot(df_us_acc, aes(x = year, fill=year)) +
    geom_bar()

As we can see in the yearly distribution graph, the dataset has been updated with multiple data sources. Thus, we decided that the year 2021 will be the optimal subset of the data.

clean_acc21 <- subset(df_us_acc, year==2021)

Let’s extract the month from the Start_Time and check the monthly distribution.

clean_acc21$Month<-as.numeric(format(as.Date(clean_acc21$Start_Time, format="%Y-%m-%d"),"%m"))
clean_acc21$Hour<-hour(clean_acc21$Start_Time)

Now, we will check the Severity distribution in the data.

ggplot(df_us_acc, aes(x = Severity, fill=Severity)) +
    geom_bar()

As we can see in the graph, the severity levels are imbalanced. We don’t have a higher number of severe impacts on the traffic due to accidents as compared to the less severe. This is also true in regards to the real world. Thus, we have decided to merge level 1 & 2 into “Not Severe” & 3 & 4 into “Severe” to make our analysis more specific.

clean_acc21 <- clean_acc21 %>% 
  mutate(Is_severe = if_else(Severity == 1 | Severity ==2 , "Not Severe", "Severe"))
clean_acc21$Is_severe <- as.factor(clean_acc21$Is_severe)

For some initial EDA, we were curious to see how the data looks on a map, particularly the DC area as we currently live here. Thus, the map below shows the accidents that took place in 2021 in the DC area.

df_map<-dplyr::select(clean_acc21, State, Start_Lat, Start_Lng)
df_map_DC <- df_map %>% filter(State == "DC")
df_map_DC_sf <- st_as_sf(df_map_DC, coords = c("Start_Lng", "Start_Lat"), crs = 4326)
mapview(df_map_DC_sf, map.types = "Stamen.Toner",col.regions=("red"))

To answer the first SMART question we have, which is “Does weather affect the severity of traffic?”, we wanted to check the distribution of data for numerical weather variables first.

tempHist <- ggplot(clean_acc21, aes(x=Temperature.F.)) + geom_histogram(color="black", fill = "red")+
  ggtitle("Histogram of Temperature(F) for accidents")


windcHist <- ggplot(clean_acc21, aes(x=Wind_Chill.F.)) + geom_histogram(color="black", fill = "orange")+
  ggtitle("Histogram of Wind chill for accidents")


humidHist <- ggplot(clean_acc21, aes(x=Humidity...)) + geom_histogram(color="black", fill = "yellow")+
  ggtitle("Histogram of Humidity for accidents")


windsHist <- ggplot(clean_acc21, aes(x=Wind_Speed.mph.)) + geom_histogram(color="black", fill = "navy")+
  ggtitle("Histogram of Wind Speed for accidents")


pressHist <- ggplot(clean_acc21, aes(x=Pressure.in.)) + geom_histogram(color="black", fill = "green")+
  ggtitle("Histogram of Pressure for accidents")


visibHist <- ggplot(clean_acc21, aes(x=Visibility.mi.)) + geom_histogram(color="black", fill = "blue")+
  ggtitle("Histogram of Visibility for accidents")


precipHist <- ggplot(clean_acc21, aes(x=Precipitation.in.)) + geom_histogram(color="black", fill = "purple")+
  ggtitle("Histogram of Precipitation for accidents")


grid.arrange(tempHist, windcHist, humidHist, windsHist, pressHist, visibHist, precipHist, ncol=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We created histograms between numbers of accidents and weather condition elements. From those histograms, we found that Temperature, Wind chill, and Humidity have left-skewed distributions.
For the rest of element, which are Wind Speed, Pressure, Visibility, and Precipitation, they have quite close mean and median with a few outliers.

wooutlier_winds <- outlierKD2(clean_acc21, Wind_Speed.mph., rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

## Outliers identified: 24710 
## Proportion (%) of outliers: 1.8 
## Mean of the outliers: 24.14 
## Mean without removing outliers: 7.12 
## Mean if we remove outliers: 6.82 
## Outliers successfully removed
clean_acc21_woo <- outlierKD2(wooutlier_winds, Pressure.in., rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

## Outliers identified: 101201 
## Proportion (%) of outliers: 7.7 
## Mean of the outliers: 26.25 
## Mean without removing outliers: 29.41 
## Mean if we remove outliers: 29.65 
## Outliers successfully removed

So we tried to remove the outliers from Wind Speed and Pressure and without outliers, and from the generated plots we can see they are more normally distributed than the original data. But we decided to keep the outliers because it is natural to have outliers in the weather variables as the data covers a whole year. Also the outliers does not affect the result of T-test.

plot1 <- ggplot(clean_acc21, aes(x = Is_severe, y=Temperature.F.)) + 
  geom_boxplot() +
  labs(title="Temperature by Severity", x="Severity", y = "Temperature(F)")

plot2 <- ggplot(clean_acc21, aes(x = Is_severe, y=Wind_Chill.F.)) + 
  geom_boxplot() +
  labs(title="Wind Chill by Severity", x="Severity", y = "Wind Chill")

plot3 <- ggplot(clean_acc21, aes(x = Is_severe, y=Wind_Speed.mph.)) + 
  geom_boxplot() +
  labs(title="Wind Speed by Severity", x="Severity", y = "Wind Speed")

plot4 <- ggplot(clean_acc21, aes(x = Is_severe, y=Humidity...)) + 
  geom_boxplot() +
  labs(title="Humidity by Severity", x="Severity", y = "Humidity")

plot5 <- ggplot(clean_acc21, aes(x = Is_severe, y=Pressure.in.)) + 
  geom_boxplot() +
  labs(title="Pressure by Severity", x="Severity", y = "Pressure")

plot6 <- ggplot(clean_acc21, aes(x = Is_severe, y=Visibility.mi.)) + 
  geom_boxplot() +
  labs(title="Visibility by Severity", x="Severity", y = "Visibility")

plot7 <- ggplot(clean_acc21, aes(x = Is_severe, y=Precipitation.in.)) + 
  geom_boxplot() +
  labs(title="Precipitation by Severity", x="Severity", y = "Precipitation")

grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, plot7, ncol=3)

We tried to see the distribution of weather elements by two different severity levels which are ‘Severe’ and ‘Not Severe’.

For Temperature, Wind chill, and Humanity, we can see the difference on range of data distribution and outliers by severity levels.

For the rest of element such as wind speed, pressure, visibility and precipitation, they still do not have a wide range of data but we can see the distribution by two severity levels more conveniently with boxplots.

box_clean2_severe = subset(clean_acc21, Is_severe == 'Severe')
box_clean2_notsevere = subset(clean_acc21, Is_severe == 'Not Severe')
print("Temperature.F")
## [1] "Temperature.F"
t.test(box_clean2_severe$Temperature.F., box_clean2_notsevere$Temperature.F.)
## 
##  Welch Two Sample t-test
## 
## data:  box_clean2_severe$Temperature.F. and box_clean2_notsevere$Temperature.F.
## t = -43.428, df = 22832, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.875845 -5.368348
## sample estimates:
## mean of x mean of y 
##  57.73754  63.35964
print("Wind_Chill.F.")
## [1] "Wind_Chill.F."
t.test(box_clean2_severe$Wind_Chill.F., box_clean2_notsevere$Wind_Chill.F.)
## 
##  Welch Two Sample t-test
## 
## data:  box_clean2_severe$Wind_Chill.F. and box_clean2_notsevere$Wind_Chill.F.
## t = -42.876, df = 22802, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.518654 -5.948717
## sample estimates:
## mean of x mean of y 
##  56.11613  62.34982
print("Humidity...")
## [1] "Humidity..."
t.test(box_clean2_severe$Humidity..., box_clean2_notsevere$Humidity...)
## 
##  Welch Two Sample t-test
## 
## data:  box_clean2_severe$Humidity... and box_clean2_notsevere$Humidity...
## t = 17.706, df = 22858, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.380009 2.972538
## sample estimates:
## mean of x mean of y 
##  67.31599  64.63971
print("Wind_Speed.mph.")
## [1] "Wind_Speed.mph."
t.test(box_clean2_severe$Wind_Speed.mph., box_clean2_notsevere$Wind_Speed.mph.)
## 
##  Welch Two Sample t-test
## 
## data:  box_clean2_severe$Wind_Speed.mph. and box_clean2_notsevere$Wind_Speed.mph.
## t = -8.0218, df = 22802, p-value = 1.092e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3806752 -0.2311737
## sample estimates:
## mean of x mean of y 
##  6.815942  7.121866
print("Visibility.mi.")
## [1] "Visibility.mi."
t.test(box_clean2_severe$Visibility.mi., box_clean2_notsevere$Visibility.mi.)
## 
##  Welch Two Sample t-test
## 
## data:  box_clean2_severe$Visibility.mi. and box_clean2_notsevere$Visibility.mi.
## t = -1.1754, df = 22823, p-value = 0.2398
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.05474086  0.01369841
## sample estimates:
## mean of x mean of y 
##  9.053624  9.074146
print("Pressure.in.")
## [1] "Pressure.in."
t.test(box_clean2_severe$Pressure.in., box_clean2_notsevere$Pressure.in.)
## 
##  Welch Two Sample t-test
## 
## data:  box_clean2_severe$Pressure.in. and box_clean2_notsevere$Pressure.in.
## t = -30.44, df = 22554, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3006229 -0.2642501
## sample estimates:
## mean of x mean of y 
##  29.12821  29.41065
print("Precipitation.in.")
## [1] "Precipitation.in."
t.test(box_clean2_severe$Precipitation.in., box_clean2_notsevere$Precipitation.in.)
## 
##  Welch Two Sample t-test
## 
## data:  box_clean2_severe$Precipitation.in. and box_clean2_notsevere$Precipitation.in.
## t = -2.8336, df = 23267, p-value = 0.004606
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0014187715 -0.0002585427
## sample estimates:
##   mean of x   mean of y 
## 0.004886261 0.005724918

First, I divided the data into two different data by subsetting by the severity to check the means of Weather elements between two different severity levels will be same or not. Then we performed the two-sample t-test on Severity and weather elements since the weather numerical variables are quantitative and we have two samples based on the severity levels.

H0: The means of Temperature/WindChill/Humidity/Wind Speed/Pressure/Visibility/Precipitation will be same between different Severity levels. H1: The means of Temperature/WindChill/Humidity/Wind Speed/Pressure/Visibility/Precipitation will NOT be same between different Severity levels.

The p-value from all tests except for Visibility are lower than 0.05 so we can reject the H0 for every weather variables but Visibility, which means that means from weather variables except for Visibility were different by its severity level of traffic.

From these t-tests, we can conclude that numerical weather variables such as Temperature, WindChill, Humidity, Pressure, Wind Speed and Precipitation affect the severity of traffic.

However, numerical weather variables are not only variables from weather conditions. There are categorical weather variables in our dataset, such as Wind Directions and Weather Conditions.

# wind direction with severity
ggplot(clean_acc21, aes(Wind_Direction, ..prop.., group = Is_severe)) +
  geom_bar(aes(fill = Is_severe)) +
  #scale_y_continuous(labels = percent) +
  labs(x = "Wind Direction",
       y = "Proportion",
       title = "Wind direction by Severity") +
  theme(text = element_text(size=8)) 

We made a bar plot to see the distribution of the Severity by wind directions. As the bar plot between Severity and Wind direction here shows the similar distribution for each severity levels on most of wind direction. So we can infer that wind direction does not affect much on the severity of traffic and decided not to perform any statistical analysis on the Wind Direction variable.

# Weather Condition percentage
unique(clean_acc21$Weather_Condition)
##  [1] "Fair"                           "Fog"                           
##  [3] "Mostly Cloudy"                  "Cloudy"                        
##  [5] "Partly Cloudy"                  ""                              
##  [7] "Light Rain"                     "Heavy T-Storm / Windy"         
##  [9] "Light Snow"                     "Rain"                          
## [11] "T-Storm"                        "Haze"                          
## [13] "Fair / Windy"                   "Smoke"                         
## [15] "Cloudy / Windy"                 "Snow"                          
## [17] "Heavy Snow"                     "Thunder"                       
## [19] "Thunder in the Vicinity"        "N/A Precipitation"             
## [21] "Heavy Rain"                     "Thunder / Windy"               
## [23] "Heavy T-Storm"                  "Mostly Cloudy / Windy"         
## [25] "Shallow Fog"                    "Mist"                          
## [27] "Partly Cloudy / Windy"          "Snow / Windy"                  
## [29] "Light Rain with Thunder"        "Light Snow / Windy"            
## [31] "Rain / Windy"                   "Wintry Mix"                    
## [33] "Heavy Rain / Windy"             "Drizzle"                       
## [35] "Light Drizzle"                  "Light Rain / Windy"            
## [37] "Haze / Windy"                   "Light Snow and Sleet"          
## [39] "Showers in the Vicinity"        "T-Storm / Windy"               
## [41] "Patches of Fog"                 "Light Freezing Rain"           
## [43] "Sand / Dust Whirlwinds"         "Light Freezing Drizzle"        
## [45] "Fog / Windy"                    "Heavy Drizzle"                 
## [47] "Light Snow with Thunder"        "Blowing Dust / Windy"          
## [49] "Rain Shower"                    "Heavy Snow / Windy"            
## [51] "Blowing Snow / Windy"           "Light Rain Shower"             
## [53] "Snow and Sleet"                 "Drizzle and Fog"               
## [55] "Light Sleet"                    "Drizzle / Windy"               
## [57] "Light Snow Shower"              "Snow and Thunder / Windy"      
## [59] "Light Sleet / Windy"            "Smoke / Windy"                 
## [61] "Blowing Dust"                   "Wintry Mix / Windy"            
## [63] "Blowing Snow"                   "Widespread Dust / Windy"       
## [65] "Light Drizzle / Windy"          "Squalls"                       
## [67] "Tornado"                        "Squalls / Windy"               
## [69] "Hail"                           "Blowing Snow Nearby"           
## [71] "Partial Fog"                    "Widespread Dust"               
## [73] "Sand / Windy"                   "Thunder / Wintry Mix"          
## [75] "Light Freezing Rain / Windy"    "Light Snow and Sleet / Windy"  
## [77] "Heavy Rain Shower / Windy"      "Small Hail"                    
## [79] "Sand / Dust Whirlwinds / Windy" "Light Rain Shower / Windy"     
## [81] "Thunder and Hail"               "Freezing Rain"                 
## [83] "Heavy Sleet"                    "Snow Grains"                   
## [85] "Sleet"                          "Freezing Drizzle"              
## [87] "Snow and Sleet / Windy"         "Freezing Rain / Windy"         
## [89] "Heavy Freezing Drizzle"         "Heavy Freezing Rain"           
## [91] "Blowing Sand"
WC <- clean_acc21 %>%
  group_by(Weather_Condition) %>%
  summarise(cnt = n()) %>%
  mutate(freq = (round(cnt/sum(cnt), 3))*100 )%>%
  arrange(desc(freq)) %>%
  filter(freq > 1)
WC
## # A tibble: 8 × 3
##   Weather_Condition    cnt  freq
##   <chr>              <int> <dbl>
## 1 Fair              686869  48.3
## 2 Cloudy            207687  14.6
## 3 Mostly Cloudy     186185  13.1
## 4 Partly Cloudy     124698   8.8
## 5 Light Rain         61960   4.4
## 6 Fog                24213   1.7
## 7 Light Snow         23197   1.6
## 8 Haze               19755   1.4
WC %>%
    ggplot() +
    geom_col(mapping = aes(x=reorder(Weather_Condition, -freq), y=freq, fill = Weather_Condition)) +
    labs(x = "Weather Condition", y="%", title ="Top 8 Weather Conditions with accidents") +
    theme(text = element_text(size=7))

Secondly, we took a look at the Weather Condition variable.
There are so many weather conditions in our dataset, so I tried to make a barplot with the top 8 weather conditions when accidents happened. The major weather conditions when car accidents happened were ‘Fair’, ‘Cloudy’, ‘Mostly Cloudy’, ‘Partly Cloudy’, ‘Light Rain’, ‘Fog’, ‘Light Snow’, and ‘Haze’.

Now we can see that the most frequent weather was ‘Fair’, but since the Weather Conditions variable is divided into detailed conditions as you can see here with cloudy, mostly cloudy, and partly cloudy, so we decided to take a chi-squared test on all weather conditions and severity.

# Try the Chi-Squared Test on all Weather_Conditions and Severity
WCtable <- table(clean_acc21$Weather_Condition , clean_acc21$Severity)
xkabledply(WCtable, title = "Severity by Weather Conditions")
Severity by Weather Conditions
2 4
3616 84
Blowing Dust 68 0
Blowing Dust / Windy 67 5
Blowing Sand 1 0
Blowing Snow 20 0
Blowing Snow / Windy 29 0
Blowing Snow Nearby 2 0
Cloudy 203765 3922
Cloudy / Windy 3837 68
Drizzle 742 23
Drizzle / Windy 5 0
Drizzle and Fog 93 2
Fair 676490 10379
Fair / Windy 8755 277
Fog 23772 441
Fog / Windy 169 3
Freezing Drizzle 10 0
Freezing Rain 25 1
Freezing Rain / Windy 1 0
Hail 2 0
Haze 19623 132
Haze / Windy 315 19
Heavy Drizzle 60 0
Heavy Freezing Drizzle 1 0
Heavy Freezing Rain 1 0
Heavy Rain 6051 88
Heavy Rain / Windy 477 3
Heavy Rain Shower / Windy 1 0
Heavy Sleet 13 0
Heavy Snow 634 26
Heavy Snow / Windy 101 5
Heavy T-Storm 2883 37
Heavy T-Storm / Windy 310 4
Light Drizzle 3108 69
Light Drizzle / Windy 45 3
Light Freezing Drizzle 155 4
Light Freezing Rain 193 11
Light Freezing Rain / Windy 20 4
Light Rain 60990 970
Light Rain / Windy 1779 34
Light Rain Shower 56 0
Light Rain Shower / Windy 1 0
Light Rain with Thunder 4139 57
Light Sleet 55 2
Light Sleet / Windy 5 0
Light Snow 22689 508
Light Snow / Windy 1172 27
Light Snow and Sleet 40 0
Light Snow and Sleet / Windy 13 0
Light Snow Shower 24 0
Light Snow with Thunder 7 0
Mist 306 4
Mostly Cloudy 183719 2466
Mostly Cloudy / Windy 3155 53
N/A Precipitation 646 29
Partial Fog 1 0
Partly Cloudy 122913 1785
Partly Cloudy / Windy 1974 30
Patches of Fog 480 6
Rain 14270 196
Rain / Windy 656 7
Rain Shower 11 0
Sand / Dust Whirlwinds 8 0
Sand / Dust Whirlwinds / Windy 1 0
Sand / Windy 2 0
Shallow Fog 512 5
Showers in the Vicinity 314 5
Sleet 31 1
Small Hail 22 1
Smoke 3977 16
Smoke / Windy 19 0
Snow 2516 64
Snow / Windy 195 4
Snow and Sleet 85 7
Snow and Sleet / Windy 7 0
Snow and Thunder / Windy 2 0
Snow Grains 5 0
Squalls 5 0
Squalls / Windy 5 2
T-Storm 4989 61
T-Storm / Windy 194 0
Thunder 4958 36
Thunder / Windy 178 1
Thunder / Wintry Mix 5 1
Thunder and Hail 1 0
Thunder in the Vicinity 5572 76
Tornado 7 0
Widespread Dust 8 0
Widespread Dust / Windy 18 0
Wintry Mix 2721 92
Wintry Mix / Windy 47 0
chitest = chisq.test(WCtable)
## Warning in chisq.test(WCtable): Chi-squared approximation may be incorrect
chitest
## 
##  Pearson's Chi-squared test
## 
## data:  WCtable
## X-squared = 984.98, df = 90, p-value < 2.2e-16

To identify if the severity of traffic dependent on Weather Condition, we took a Chi-Squared test. As Weather Condition is a categorical variables, and we treated the severity in an original form which is numerical so we can see the dependency between weather conditions and the severity using a chi-squared test.

H0 : Severity and weather conditions are independent. H1 : Severity and weather conditions are NOT independent.
As we can see here, the the P-value from Chi-squared test is lower than 0.05 for Weather Conditions variable so we can reject the H0. Which means the Severity of traffic and weather conditions are Dependent.

From these analysis on numerical and categorical weather variables, we can answer our SMART question about the impact for weather on the severity of traffic. We concluded that the numerical weather condition elements except for visibility affect the severity of traffic. However, the wind direction does not affect much the severity of traffic because it does not show differences on severity by each of directions. For the weather conditions variables, we can observe that weather conditions at the time accidents happened affect the severity of traffic.

SMART Question 2: Do Nearby Road Elements affect the severity of traffic?

To determine whether nearby road components have an impact on the severity of the traffic, we have conducted the exploratory data analysis listed below.

clean_acc21 %>% 
      group_by(Amenity, Bump, Crossing, Give_Way, Junction, No_Exit, Railway, Roundabout, Station, Stop, Traffic_Calming, Traffic_Signal, Turning_Loop)%>%
    summarise(percentage = n()/nrow(clean_acc21) *100 )%>%
    arrange(-percentage) %>%
    filter(percentage > 1) -> accidents_per_roadelement
## `summarise()` has grouped output by 'Amenity', 'Bump', 'Crossing', 'Give_Way',
## 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',
## 'Traffic_Calming', 'Traffic_Signal'. You can override using the `.groups`
## argument.
head(accidents_per_roadelement)
## # A tibble: 6 × 14
## # Groups:   Amenity, Bump, Crossing, Give_Way, Junction, No_Exit, Railway,
## #   Roundabout, Station, Stop, Traffic_Calming, Traffic_Signal [6]
##   Amenity Bump  Crossing Give_Way Junction No_Exit Railway Round…¹ Station Stop 
##   <chr>   <chr> <chr>    <chr>    <chr>    <chr>   <chr>   <chr>   <chr>   <chr>
## 1 False   False False    False    False    False   False   False   False   False
## 2 False   False False    False    True     False   False   False   False   False
## 3 False   False False    False    False    False   False   False   False   False
## 4 False   False True     False    False    False   False   False   False   False
## 5 False   False True     False    False    False   False   False   False   False
## 6 False   False False    False    False    False   False   False   True    False
## # … with 4 more variables: Traffic_Calming <chr>, Traffic_Signal <chr>,
## #   Turning_Loop <chr>, percentage <dbl>, and abbreviated variable name
## #   ¹​Roundabout

Here, we’ve compiled a list of incidents where surrounding road features like an amenity, a bump, a crossing, a give-way, a junction, a no-exit, a railroad, a roundabout, a station, a stop, a traffic signal, or a turning loop contributed to the accident.

acc_road_element_per <- tibble(c("None", "Junction","Crossing", "Traffic signal", "Crossing and traffic signal", "Station", "Stop" ), pull(accidents_per_roadelement, percentage), .name_repair = ~ c("road_elements", "percentage"))
acc_road_element_per
## # A tibble: 7 × 2
##   road_elements               percentage
##   <chr>                            <dbl>
## 1 None                             77.2 
## 2 Junction                          6.53
## 3 Crossing                          4.11
## 4 Traffic signal                    2.94
## 5 Crossing and traffic signal       2.46
## 6 Station                           1.54
## 7 Stop                              1.38

The tibble above illustrates accidents that happened as a result of local road elements in areas where there are more than 1% of road accidents.

The graph above shows the accidents that happened as a result of surrounding road features.

accidents_per_roadelement<-clean_acc21 %>% 
      select(Is_severe,Junction,Crossing,Stop,Station,Traffic_Signal)


head(accidents_per_roadelement)
##       Is_severe Junction Crossing  Stop Station Traffic_Signal
## 8893 Not Severe    False    False False   False          False
## 8894 Not Severe    False    False False   False          False
## 8895 Not Severe    False    False False    True          False
## 8896 Not Severe    False    False False   False          False
## 8897 Not Severe    False    False False   False          False
## 8898 Not Severe    False    False False   False          False

Here, we’ve created a data frame with columns for junction, crossing, stop, station, and traffic signal, as well as a severity rating.

Junction_data<- subset(accidents_per_roadelement, Junction=="True",
select=c(Is_severe,Junction))
head(Junction_data)
##       Is_severe Junction
## 8907 Not Severe     True
## 8913 Not Severe     True
## 8923 Not Severe     True
## 8931 Not Severe     True
## 8934 Not Severe     True
## 8988 Not Severe     True
Crossing_data<- subset(accidents_per_roadelement, Crossing=="True",
select=c(Is_severe,Crossing))
head(Crossing_data)
##       Is_severe Crossing
## 8939 Not Severe     True
## 8940 Not Severe     True
## 8983 Not Severe     True
## 8989 Not Severe     True
## 8992 Not Severe     True
## 8996 Not Severe     True
Stop_data<- subset(accidents_per_roadelement, Stop=="True",
select=c(Is_severe,Stop))
head(Stop_data)
##       Is_severe Stop
## 8939 Not Severe True
## 8964 Not Severe True
## 9016 Not Severe True
## 9123 Not Severe True
## 9133 Not Severe True
## 9135 Not Severe True
Station_data<- subset(accidents_per_roadelement, Station=="True",
select=c(Is_severe,Station))
head(Station_data)
##       Is_severe Station
## 8895 Not Severe    True
## 8922 Not Severe    True
## 8939 Not Severe    True
## 8959 Not Severe    True
## 8974 Not Severe    True
## 8985 Not Severe    True
Traffic_Signal_data<- subset(accidents_per_roadelement, Traffic_Signal=="True",
select=c(Is_severe,Traffic_Signal))
head(Traffic_Signal_data)
##       Is_severe Traffic_Signal
## 8930 Not Severe           True
## 8940 Not Severe           True
## 8952 Not Severe           True
## 8970 Not Severe           True
## 8983 Not Severe           True
## 8995 Not Severe           True

Here, the data frame is divided based on the accident’s severity level and the surrounding road element that caused it.

ggplot(data=Junction_data, aes(x=Is_severe, y=Junction, fill=Is_severe)) +
  geom_bar(stat="identity")+
  labs(title="Accidents By Junction", x="Severity", y = "Junction")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data=Crossing_data, aes(x=Is_severe, y=Crossing, fill=Is_severe)) +
  geom_bar(stat="identity")+
  labs(title="Accidents By Crossing", x="Severity", y = "Crossing")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data=Stop_data, aes(x=Is_severe, y=Stop, fill=Is_severe)) +
  geom_bar(stat="identity")+
  labs(title="Accidents By Stop", x="Severity", y = "Stop")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data=Station_data, aes(x=Is_severe, y=Station, fill=Is_severe)) +
  geom_bar(stat="identity")+
  labs(title="Accidents By Station", x="Severity", y = "Station")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data=Traffic_Signal_data, aes(x=Is_severe, y=Traffic_Signal, fill=Is_severe)) +
  geom_bar(stat="identity")+
  labs(title="Accidents By Traffic Signal", x="Severity", y = "Traffic Sisgnal")+
  theme(plot.title = element_text(hjust = 0.5))

The aforementioned graphs show the degree of severity caused by neighboring road elements such junctions, crossings, stops, stations, and traffic signals.

accidents_per_roadelement <- accidents_per_roadelement %>% 
  mutate(severe_num = if_else(Is_severe== "Severe", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>% 
  mutate(Crossing_num = if_else(Crossing== "True", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>% 
  mutate(Stop_num = if_else(Stop== "True", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>% 
  mutate(Station_num = if_else(Station== "True", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>% 
  mutate(Traffic_Signal_num = if_else(Traffic_Signal== "True", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>% 
  mutate(Junction_num = if_else(Junction== "True", 1, 0))

In order to run the anova test, we are turning the category data presented here into numerical data.

Junction_anova = aov(Junction_num~severe_num, data=accidents_per_roadelement)
summary(Junction_anova)
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## severe_num        1     18  18.066   288.4 <2e-16 ***
## Residuals   1423119  89156   0.063                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Crossing_anova = aov(Crossing_num~severe_num, data=accidents_per_roadelement)
summary(Crossing_anova)
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## severe_num        1     12  12.333   174.5 <2e-16 ***
## Residuals   1423119 100567   0.071                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Stop_anova = aov(Stop_num~severe_num, data=accidents_per_roadelement)
summary(Stop_anova)
##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## severe_num        1      1  0.6189   30.45 3.42e-08 ***
## Residuals   1423119  28921  0.0203                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Station_anova = aov(Station_num~severe_num, data=accidents_per_roadelement)
summary(Station_anova)
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## severe_num        1      8   7.844   276.9 <2e-16 ***
## Residuals   1423119  40314   0.028                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Traffic_signal_anova = aov(Traffic_Signal_num~severe_num, data=accidents_per_roadelement)
summary(Traffic_signal_anova)
##                  Df Sum Sq Mean Sq F value  Pr(>F)   
## severe_num        1      1  0.5787   8.023 0.00462 **
## Residuals   1423119 102642  0.0721                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the outcome variable is quantitative and the predictor variable is categorical, we are doing an anova test. H0 : The mean of Severity of traffic are the same across Nearby Road Elements( Junction,Crossing, Stop,Station, Traffic Signal). H1 : The mean of Severity of traffic are the not the same across Nearby Road Elements( Junction,Crossing, Stop,Station, Traffic Signal). The P-value from Anova test is lower than 0.05 for all the Nearby Road Elements variable so we can reject the H0. → Nearby Road Elements Junction,Crossing, Stop, Station and Traffic Signal affects the Severity of Traffic.

Does the occurrence of the accident during a particular time of day or year affect the severity of the accident?

First, let’s look at some grapghs to get a better idea.

The Graph above shows frequency of accidents in 2021 during different Months. We can see that there is an increase in number of accidents during the end of the year.

The Graph above shows the frequency of accidents in 2021 during different hours of the day. It can be observed that there is spike in frequency of accidents during afternoon to evening, probably because it is the peak hours.

Took a Chi-Squared test to see if the severity and Hour of day are independent. We have performed Chi-Squared test because both variables are categorical. H0 : Severity and Hour are independent H1 : Severity and Hour are NOT independent

test_Hour <- chisq.test(table(clean_acc21$Severity, clean_acc21$Hour))
test_Hour
## 
##  Pearson's Chi-squared test
## 
## data:  table(clean_acc21$Severity, clean_acc21$Hour)
## X-squared = 1641.7, df = 23, p-value < 2.2e-16

We can reject the H0 because the p value is less than 0.05.

Took a Chi-Squared test to see the severity and weather Months are independent. We have performed Chi-Squared test because both variables are categorical. H0 : Severity and Month are independent H1 : Severity and Month are NOT independent

test_Month <- chisq.test(table(clean_acc21$Severity, clean_acc21$Month))
test_Month
## 
##  Pearson's Chi-squared test
## 
## data:  table(clean_acc21$Severity, clean_acc21$Month)
## X-squared = 443.62, df = 11, p-value < 2.2e-16

We can reject the H0 because the p value is less than 0.05.

Correlation test performed to test strength of relationship between severity and Month.

cor(clean_acc21$Severity, clean_acc21$Month)
## [1] -0.007085495

Since the value is so small, we can conclude that the the relationship is weak.

Correlation test performed to test strength of relationship between severity and Hour.

cor(clean_acc21$Severity, clean_acc21$Hour)
## [1] -0.002004478

Since the value is so small, we can conclude that the the relationship is weak.

Conclusion:

As mentioned earlier, our dataset represents real world scenario. And there aren’t many accidents that affect the traffic severely. But in the small amount of cases that it does happen, it will be due to the following factors. Does weather affect the severity of traffic? • Temperature/Wind Chill/Wind Speed/Humidity/Pressure/Precipitation affect the severity of traffic. • Wind Direction does not affect the severity of traffic much. • Weather Conditions affect the severity of traffic. Do Nearby Road Elements affect the severity of traffic? • Nearby Road Elements Junction, Crossing, Stop, Station and Traffic Signal affects the Severity of Traffic. Does the occurrence of the accident during a particular time of day or year affect the severity of the accident? • Both Hour and Month affect the Severity but with a weak relationship.